Model Selection

Multimodal Retrieval

# Multimodal Retrieval

FG-CLIP is a fine-grained visual and text alignment model that achieves global and region-level image-text alignment through two-stage training.

Transformers English

CLIP ViT L 14 Spectrum Icons 20k

A vision-language model fine-tuned based on CLIP ViT-L/14, optimized for abstract image-text retrieval tasks

TensorBoard English

Prolip ViT B 16 DC 1B 12 8B

Probabilistic Language-Image Pretraining (ProLIP) ViT-B/16 model pretrained on the DataComp 1B dataset

Jina CLIP v2 is a versatile multilingual multimodal embedding model suitable for text and images, supporting 89 languages, with higher image resolution and nested representation capabilities.

Transformers Supports Multiple Languages

CLIP GmP ViT L 14

A fine-tuned model based on OpenAI CLIP ViT-L/14, achieving performance improvements through Geometric Parametrization (GmP), with special optimization for text encoding capabilities

Pmc Vit L 14 Hf

A vision-language model fine-tuned on the PMC-OA dataset based on CLIP-ViT-L/14

CLIP ViT B 16 DataComp.XL S13b B90k

This is a CLIP ViT-B/16 model trained using OpenCLIP on the DataComp-1B dataset, primarily used for zero-shot image classification and image-text retrieval.

Arabic Clip Vit Base Patch32

Arabic CLIP is an adapted version of the Contrastive Language-Image Pre-training (CLIP) model for Arabic, capable of learning concepts from images and associating them with Arabic text descriptions.

Text-to-Image Arabic

CLIP ViT Bigg 14 Laion2b 39B B160k

A vision-language model trained on the LAION-2B dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval

CLIP Convnext Base W Laion2b S13b B82k Augreg

CLIP model based on ConvNeXt-Base architecture, trained on a subset of LAION-5B using OpenCLIP, focusing on zero-shot image classification tasks

Taiyi CLIP RoBERTa 102M ViT L Chinese

The first open-source Chinese CLIP model, pre-trained on 123 million text-image pairs, with a text encoder based on the RoBERTa-base architecture.

Transformers Chinese

CLIP ViT H 14 Laion2b S32b B79k

A vision-language model trained on the LAION-2B English dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval tasks

CLIP ViT L 14 Laion2b S32b B82k

A vision-language model trained on the English subset of LAION-2B using the OpenCLIP framework, supporting zero-shot image classification and image-text retrieval

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase